Model Selection

Visual Instruction Tuning

# Visual Instruction Tuning

Llava MORE Llama 3 1 8B Finetuning

LLaVA-MORE is an enhanced version based on the LLaVA architecture, integrating LLaMA 3.1 as the language model, focusing on image-to-text tasks.

Instructblip Flan T5 Xl 8bit Nf4

InstructBLIP is a vision-instruction-tuned version based on BLIP-2, combining visual and language processing capabilities to generate responses based on images and textual instructions.

Transformers English

Instructblip Flan T5 Xl 8bit Nf4

InstructBLIP is a vision instruction tuning model based on BLIP-2, using Flan-T5-xl as the language model, capable of generating descriptions based on images and text instructions.

Transformers English

Mediocreatmybest

Instructblip Flan T5 Xxl 8bit Nf4

InstructBLIP is the vision-instruction-tuned version of BLIP-2, combining vision and language models to generate descriptions or answer questions based on images and text instructions.

Transformers English

Mediocreatmybest

Instructblip Flan T5 Xl 8bit

InstructBLIP is the vision-instruction-tuned version of BLIP-2, based on the Flan-T5-xl language model, designed for image-to-text generation tasks.

Transformers English

Mediocreatmybest

Instructblip Vicuna 13b

InstructBLIP is the visual instruction-tuned version of BLIP-2, based on the Vicuna-13b language model, designed for vision-language tasks.

Transformers English

Instructblip Flan T5 Xxl

InstructBLIP is the vision-instruction-tuned version of BLIP-2, capable of generating descriptions or answers based on images and text instructions

Transformers English

Instructblip Vicuna 7b

InstructBLIP is a vision instruction-tuned version based on BLIP-2, using Vicuna-7B as the language model, focusing on vision-language tasks.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase